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Abstract 

We consider the Minimum Description Length principle for online sequence 
prediction. If the underlying model class is discrete, then the total expected 
square loss is a particularly interesting performance measure: (a) this quantity 
is bounded, implying convergence with probability one, and (b) it addition- 
ally specifies a rate of convergence. Generally, for MDL only exponential loss 
bounds hold, as opposed to the linear bounds for a Bayes mixture. We show 
that this is even the case if the model class contains only Bernoulli distribu- 
tions. We derive a new upper bound on the prediction error for countable 
Bernoulli classes. This implies a small bound (comparable to the one for 
Bayes mixtures) for certain important model classes. The results apply to 
many Machine Learning tasks including classification and hypothesis testing. 
We provide arguments that our theorems generalize to countable classes of 
i.i.d. models. 
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1 Introduction 



"Bayes mixture", "Solomonoff induction", "marginalization" , all these terms refer 
to a central induction principle: Obtain a predictive distribution by integrating 
the product of prior and evidence over the model class. In many cases however, 
the Bayes mixture cannot be computed, and even a sophisticated approximation is 
expensive. The MDL or MAP (maximum a posteriori) estimator is both a common 
approximation for the Bayes mixture and interesting for its own sake: Use the model 
with the largest product of prior and evidence. (In practice, the MDL estimator 
is usually being approximated too, in particular when only a local maximum is 
determined.) 

How good are the predictions by Bayes mixtures and MDL? This question has 
attracted much attention. In the context of prediction, arguably the most important 
quality measure is the total or cumulative expected loss of a predictor. A very 
common choice of loss function is the square loss. Throughout this paper, we will 
study this quantity in an online setup. 

Assume that the outcome space is finite, and the model class is continuously 
parameterized. Then for Bayes mixture prediction, the cumulative expected square 
loss is usually small but unbounded, growing with logn, where n is the sample 
size [CB90]. This corresponds to an instantaneous loss bound of ^. For the MDL 
predictor, the losses behave similarly [Ris96, BRY98] under appropriate conditions, 
in particular with a specific prior. (Note that in order to do MDL for continuous 
model classes, one needs to discretize the parameter space, e.g. [BC91].) 

On the other hand, if the model class is discrete, then Solomonoff's theorem 
[Sol78, HutOl] bounds the cumulative expected square loss for the Bayes mixture 
predictions finitely, namely by In if" 1 , where is the prior weight of the "true" 
model The only necessary assumption is that the true distribution /i is contained 
in the model class. For the corresponding MDL predictions, we have shown [PH04] 
that a bound of w" 1 holds. This is exponentially larger than the Solomonoff bound, 
and it is sharp in general. A finite bound on the total expected square loss is 
particularly interesting: 

1. It implies convergence of the predictive to the true probabilities with probabil- 
ity one. In contrast, an instantaneous loss bound which tends to zero implies 
only convergence in probability. 

2. Additionally, it gives a convergence speed, in the sense that errors of a certain 
magnitude cannot occur too often. 

So for both, Bayes mixtures and MDL, convergence with probability one holds, 
while the convergence rate is exponentially worse for MDL compared to the Bayes 
mixture. 

It is therefore natural to ask if there are model classes where the cumulative loss 
of MDL is comparable to that of Bayes mixture predictions. Here we will concentrate 
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on the simplest possible stochastic case, namely discrete Bernoulli classes (compare 
also [Vov97]). It might be surprising to discover that in general the cumulative loss 
is still exponential. On the other hand, we will give mild conditions on the prior 
guaranteeing a small bound. We will provide arguments that these results generalize 
to arbitrary i.i.d. classes. Moreover, we will see that the instantaneous (as opposed 
to the cumulative) bounds are always small (~ This corresponds to the well- 
known fact that the instantaneous square loss of the Maximum Likelihood estimator 
decays as - in the Bernoulli case. 

A particular motivation to consider discrete model classes arises in Algorithmic 
Information Theory. From a computational point of view, the largest relevant model 
class is the countable class of all computable models (isomorphic to programs) on 
some fixed universal Turing machine. We may study the corresponding Bernoulli 
case and consider the countable set of computable reals in [0, 1]. We call this the 
universal setup. The description length K(-&) of a parameter $ G [0, 1] is then given 
by the length of the shortest program that outputs i?, and a prior weight may be 
defined by 2^). 

Many Machine Learning tasks are or can be reduced to sequence prediction tasks. 
An important example is classification. The task of classifying a new instance z n 
after having seen (instance, class) pairs (zi, Ci), (z n -i, c n _i) can be phrased as 
to predict the continuation of the sequence Z\C\...z n -\C n -\Z n . Typically the (in- 
stance, class) pairs are i.i.d. 

Our main tool for obtaining results is the Kullback-Leibler divergence. Lemmata 
for this quantity are stated in Section 2. Section 3 shows that the exponential error 
bound obtained in [PH04] is sharp in general. In Section 4, we give an upper bound 
on the instantaneous and the cumulative losses. The latter bound is small e.g. under 
certain conditions on the distribution of the weights, this is the subject of Section 
5. Section 6 treats the universal setup. Finally, in Section 7 we discuss the results 
and give conclusions. 



2 Kullback-Leibler Divergence 

Let B = {0, 1} and consider finite strings x G B* as well as infinite sequences 
x <OQ G B°°, with the first n bits denoted by x\ :n . If we know that x is generated 
by an i.i.d random variable, then P(xi — 1) = i?o for all 1 < % < £(x) where £(x) 
is the length of x. Then x is called a Bernoulli sequence, and $o G O C [0, 1] the 
true parameter. In the following we will consider only countable O, e.g. the set of 
all computable numbers in [0, 1]. 

Associated with each $ G ©, there is a complexity or description length Kw($) 
and a weight or (semi)probability w$ = 2~ Kw ^\ The complexity will often but need 
not be a natural number. Typically, one assumes that the weights sum up to at 
most one, Yldee w & — 1- Then, by the Kraft inequality, for all $ G O there exists a 
prefix-code of length Kw{d). Because of this correspondence, it is only a matter of 
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convenience if results are developed in terms of description lengths or probabilities. 
We will choose the former way. We won't even need the condition ^2^w^ < 1 for 
most of the following results. This only means that Kw cannot be interpreted as a 
prefix code length, but does not cause other problems. 

Given a set of distributions C [0, 1], complexities (Kw($))^ e , a true distribu- 
tion $ G 0, and some observed string x G B*, we define an MDL estimator 1 : 

i? x = argmax{m#P(a;|'i?o — $)}■ 

Here, P(x\"&o = i?) is the probability of observing x if i? is the true parameter. 
Clearly, P(x\& Q — — ^^(l — i?)^( x ) -I ( x ), where H(x) is the number of ones in x. 
Hence P{x\~&q = •&) depends only on £(x) and JL(x). We therefore see 

= #(*,n) = arg max{^ (#*(1 -^) 1 " a ) n } (1) 

= argmin{n-D(Q;||i?) + Kwft)- ln2}, 

where n = £(x) and a := |£4 is the observed fraction of ones and 

D{a\\&)=ahi% + {\-a) In 

is the Kullback-Leibler divergence. Let i?, i? G be two parameters, then it follows 
from (1) that in the process of choosing the MDL estimator, d is being preferred to 
d iff 

n(D(a\\&) - D(a\\&)) > In 2 • - (2) 

In this case, we say that d beats It is immediate that for increasing n the influence 
of the complexities on the selection of the maximizing element decreases. We are now 
interested in the total expected square prediction error (or cumulative square loss) of 
the MDL estimator J2n=i E(^ 1:n - $o) 2 - In terms of [PH04], this is the static MDL 
prediction loss, which means that a predictor/estimator d x is chosen according to 
the current observation x. The dynamic method on the other hand would consider 
both possible continuations xO and xl and predict according to d x0 and f} xl . In the 
following, we concentrate on static predictions. They are also preferred in practice, 
since computing only one model is more efficient. 

Let A n = : < k < n}. Given the true parameter -$ and some n G N, the 
expectation of a function : {0, . . . , n} — > R is given by 

E/ (n) = P(a\n)f(an), where p(a\n) = Q (^(1 - ^o) 1 ^)". (3) 

1 Precisely, we define a MAP (maximum a posteriori) estimator. For two reasons, our definition 
might not be considered as MDL in the strict sense. First, MDL is often associated with a specific 
prior, while we admit arbitrary priors. Second and more importantly, when coding some data x, 
one can exploit the fact that once the parameter d x is specified, only data which leads to this 
d x needs to be considered. This allows for a description shorter than Kw(d x ). Nevertheless, the 
construction principle is commonly termed MDL, compare e.g. the "ideal MDL" in [VL00]. 
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(Note that the probability p(a\n) depends on i? , which we do not make explicit in 
our notation.) Therefore, 

oo oo 

E(0 Xl! " - ^ ) 2 = J2Y1 P( a \ n )($ (a ' n) - ^o) 2 - (4) 

n=l n=l a£A n 

X x X 

Denote the relation / = 0(g) by / < g. Analogously define ">" and "=". From 
[PH04, Corollary 12], we immediately obtain the following result. 

Theorem 1 The cumulative loss bound ^ n E(^ 1: " - $ ) 2 < 2 Kw ^ holds. 

This is the "slow" convergence result mentioned in the introduction. In con- 
trast, for a Bayes mixture, the total expected error is bounded by Kw($q) rather 
than 2 Kw ^ (see [Sol78] or [HutOl, Th.l]). An upper bound on ^ n E(f 1: » - tf ) 2 
is termed as convergence in mean sum and implies convergence r d Xl ' n — > t?o with 
probability 1 (since otherwise the sum would be infinite). 

We now establish relations between the Kullback-Leibler divergence and the 
quadratic distance. We call bounds of this type entropy inequalities. 

Lemma 2 Let -d, # G (0, 1) and ■&* = argmm{|$ — ||, |# — ||}, i.e. •&* is the element 
from {&, $} which is closer to \. Then 

2 • (0 - tf) 2 < D(®\&) < §(# - 7?) 2 and 

( ""' ))a z, ( ^) ( < } 3 ^^) 2 

2tf*(l -#*) ~ 1 11 ; ~ 2t?*(1 -#*) 

Thereby, (ii) requires i? G [|, §], (m) requires $ < \, and (iv) requires $ < \ 
and $ G [§,3$]. Statements (Hi) and (iv) have symmetric counterparts for d > \. 



Proof. The lower bound (i), is standard, see e.g. [LV97, p. 329]. In order to verify 
the upper bound (ii), let f(rj) = D(d\\r]) — ^(r] — d) 2 . Then (ii) follows from f(rj) < 
for r] G [\, §]. We have that /(#) = and /' (77) = _ f (77 -tf). This difference 

is nonnegative if and only 77 — 1? < since 77(1 — 77) > ^. This implies 7(77) < 0. 
Statements (Hi) and (if) giving bounds if $ is close to the boundary are proven 
similarly. □ 

Lemma 2 (ii) is sufficient to prove the lower bound on the error in Proposition 
5. The bounds (Hi) and (iv) are only needed in the technical proof of the upper 
bound in Theorem 8, which will be omitted. It requires also similar upper and lower 
bounds for the absolute distance, and if the second argument of D(-\\-) tends to 
the boundary. The lemma remains valid for the extreme cases 1?, d G {0, 1} if the 
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fraction jj is properly defined. It is likely to generalize to arbitrary alphabet, for (i) 
this is shown in [HutOl]. 

It is a well-known fact that the binomial distribution may be approximated by 
a Gaussian. Our next goal is to establish upper and lower bounds for the binomial 
distribution. Again we leave out the extreme cases. 

Lemma 3 Let i? G (0, 1) be the true parameter, n > 2 and 1 < k < n — 1, and 
a — K Then the following assertions hold. 

(i) p(a\n) < 1 = exp ( - nD(a\\# )), 

yj 2,not(l — a)n 

{ii) p(a\n) > 1 = exp ( - nD(a\\-& Q )). 

y/oa(l — ajn 

The lemma is verified using Stirling's formula. The upper bound is sharp for 
n — > oo and fixed a. Lemma 3 can be easily combined with Lemma 2, yielding 
Gaussian estimates for the Binomial distribution. The following lemma is proved 
by simply estimating the sums by appropriate integrals. 



Lemma 4 Let z E R + 7 then 



, a/7T 1 v /— 2 \Pk 1 

{%) —r = < > y/n ■ exp(-z n) < —r H = and 

2z 3 zyple ^ V ~ 2z 3 zyple 

v n=l v 



(ii) 2 exp(— z 2 n) < y/n/z. 



n=l 



3 Lower Bound 

We are now in the position to prove that even for Bernoulli classes the upper bound 
from Theorem 1 is sharp in general. 

Proposition 5 Let ^ = \ be the true parameter generating sequences of fair coin 
flips. Assume there are 2 N — 1 other parameters . . . , $2^-1 with $k = \ + 2~ k ~ 1 . 
Let all complexities be equal, i.e. Kw^o) = Kw(~&i) = . . . = Kw( r d 2 N -i) = N . Then 

00 

^E(tf -r) 2 > gL( 2 ^-5) = 2^°). 

n=l 



Proof. Recall that $ x = $ {a > n) the maximizing element for some observed sequence 
x only depends on the length n and the observed fraction of ones a. In order 
to obtain an estimate for the total prediction error 

J2 n E (^o - "d x ) 2 , partition the 



6 



interval [0, 1] into 2 N disjoint intervals I k , such that |Jfc=o 1 ^ = IP' -*-]■ Then consider 
the contributions for the observed fraction a falling in I k separately: 



C{k) = Yl P(«l«)(^ (Q,n) - #o) : 

n=l agA n n/fc 



(5) 



(compare (3)). Clearly, E(-# — -d x ) 2 = ^2 k C(k) holds. We define the partitioning 
(4) as Jo = [0, \ + 2~ 2N ) = [O,0 2 *-i), h = [f , 1] = 1], and 

4 = [# fc> 7 ? fc _ 1 ) for all 2 < Jfe < 2^ - 1. 

Fix fc G {2, . . . , 2 N — 1} and assume a G I k . Then 

^(a,n) = argm i n { nj D( a ||^) +Kw{&) ln2} = axgmin{ral>(a||#)} G {& k ,#k-i} 

according to (1). So clearly (# a ' n ) — $ ) 2 > ($ fc — ^ ) 2 = 2~ 2k ~ 2 holds. Since p(a|n) 
decreases for increasing \a — $o\, we have p(a\n) > p{dk-i\n). The interval I k has 
length 2~ k ~ 1 , so there are at least |_tt,2 fc 1 J > n2~ k ~ 1 — 1 observed fractions a falling 
in the interval. From (5), the total contribution of a G I k can be estimated by 

oo 

C(k) > ^2~ 2fc - 2 (n2- fc - 1 - l)p{d k -i\n). 

n=l 

Note that the terms in the sum even become negative for small n, which does not 
cause any problems. We proceed with 



p(#k-i\ri) > 



' exp[-n J D(| + 2- fc ||i)] > -L exp [ - np- 2k ] 



V8-2- 



n 



2n 



according to Lemma 3 and Lemma 2 (ii). By Lemma 4 (i) and (ii), we have 



£v^exp[-n|2- 2fe ] > 



n=l 

oo 



V2~e\ 8 



2 k and 



-^n-^exp[-n|2- 2fc ] > ~V^Jp k - 



n=l 



Considering only k > 5, we thus obtain 



C(k) > 



> 







"30F 2 


7=2" 


h 


16 


Vs 

16 


3v^2' 5 - 


1 

^P2e 



'2e 

- 2k - 1 - ^2- k 



2- 1 - v^F2 fc 
3tt 



> 



8 



2- 5 - 



-4=2"" > 1. 
16V2i 84 



Ignoring the contributions for < 4, this implies the assertion. 



□ 
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This result shows that if the parameters and their weights are chosen in an 
appropriate way, then the total expected error is of order w^ 1 instead of IylWq 1 . 
Interestingly, this outcome seems to depend on the arrangement and the weights of 
the false parameters rather than on the weight of the true one. One can check with 
moderate effort that the proposition still remains valid if e.g. wo is twice as large as 
the other weights. Actually, the proof of Proposition 5 shows even a slightly more 
general result, namely the same bound holds when there are additional arbitrary 
parameters with larger complexities. This will be used for Example 14. Other and 
more general assertions can be proven similarly. 



4 Upper Bounds 

Although the cumulative error may be large, as seen in the previous section, the 
instantaneous error is always small. 

Proposition 6 For n > 3, the expected instantaneous square loss is bounded: 



E (tf - r^f < ( lll2 )-M^) + V2(\n2)Kw($ )\nn + 61nn 

2n n n 

Proof. We give an elementary proof for the case i?o £ only. Like in the 

proof of Proposition 5, we consider the contributions of different a separately. By 
Hoeffding's inequality, P(|ct — -$o| > -7=) < 2e~ 2c2 for any c > 0. Letting c = Vhin, 

the contributions by these a are thus bounded by \ < — . 

ti — n 

On the other hand, for \a — i?o| < ^7=, recall that $0 beats any d iff (2) holds. 
According to Kw($) < 1, \a — i? | < ^7=, and Lemma 2 (?) and (ii), (2) is already 

implied by \a — 1?| > \J 2 < ^ Il2 ^ Kw ^ d ^ + s c '__ Clearly, a contribution only occurs if 1? beats 
i?o, therefore if the opposite inequality holds. Using \a — i?o I < ^7= again and the 
triangle inequality, we obtain that 

_ < 5c 2 + |(ln2)i^(^ ) + v /2(ln2)if W (i9 )c 2 

// 

in this case. Since we have chosen c = y/\nn, this implies the assertion. □ 

One can improve the bound in Proposition 6 to E(-# — $ Xl]n ) 2 < by a 

refined argument, compare [BC91]. But the high-level assertion is the same: Even if 
the cumulative upper bound may tend to infinity, the instantaneous error converges 
rapidly to 0. Moreover, the convergence speed depends on Kw^q) as opposed to 
2 Kw ^°\ Thus $ tends to i?o rapidly in probability (recall that the assertion is not 
strong enough to conclude almost sure convergence). The proof does not exploit 
^2 w# < 1, but only w# < 1, hence the assertion even holds for a maximum likelihood 
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estimator (i.e. w$ — 1 for all ■& G ©). The theorem generalizes to i.i.d. classes. For 
the example in Proposition 5, the instantaneous bound implies that the bulk of 
losses occurs very late. This does not hold for general (non-i.i.d.) model classes: 
The losses in [PH04, Example 9] grow linearly in the first n steps. 

We will now state our main positive result that upper bounds the cumulative 
loss in terms of the negative logarithm of the true weight and the arrangement of 
the false parameters. We will only give the proof idea - which is similar to that of 
Proposition 5 - and omit the lengthy and tedious technical details. 

Consider the cumulated sum square error 

E n E($ (a ' n) - $ ) 2 - In order to upper 
bound this quantity, we will partition the open unit interval (0, 1) into a sequence of 
intervals (Ik)kLn each of measure 2 _fe . (More precisely: Each I k is either an interval 
or a union of two intervals.) Then we will estimate the contribution of each interval 
to the cumulated square error, 

oo 

C(k)=J2 E p(a|n)(0< a ' B >-0 o ) 2 

^compare (3) and (5)). Note that $ (a ' n) G I k precisely reads $ (a ' n) G I k n ©, but for 
convenience we generally assume d G for all d being considered. This partitioning 
is also used for a, i.e. define the contribution C(k,j) of i? G I k where a G Ij as 

oo 

«=i aeA n nij ,i?(«>«) elk 

We need to distinguish between a that are located close to # and a that are located 
far from i? - "Close" will be roughly equivalent to j > k, "far" will be approximately 
3 < k. So we get £ n E(0(«"> ~ ^o) 2 = C(k) = £ fc E, C(k,j). In the proof, " 

X _ 1 

p(a\n) < [na(l — a)] 2 exp [ — nD(a\\'d )} 

X 

is often applied, which holds by Lemma 3 (recall that f < g stands for / = 0(g)). 
Terms like Z)(a:||i?o)j arising in this context and others, can be further estimated 
using Lemma 2. We now give the constructions of intervals I k and complementary 
intervals J k . 

Definition 7 Let i5 6 6 be given. Start with J = [0,1). Let 4_i = [& l k ,& r k ) 
and define d k — i?£ — $ k = 2~ k+1 . Then I k , J k C 4-i are constructed from 4-i 
according to the following rules. 

e K, ^ + f 4) =► 4 = K 4 + \d k ), I k = [4 + |4, (6) 

4e [4 + §4,4 + §4) 4 = [4 + i4,4 + |4), (7) 

4 = [^,4 + |4)uK + |4,^), 

e [4 + |4, 0J) =► 4 = [4 + |4, ^ fc ), 4 = [4, 4 + |4). (8) 
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; Jl „ h_ 

k-- 



k — 2: — | 1 1 1 1 H- Figure 1: Example 
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tervals for -# = 
We have an 1-step, a 
c-step, an 1-step and 
another c-step. All 

^ _ ^ | ' | | ~~y ' | , following steps will 

JL s n — J. _Z_ I be also c-steps. 

32 16 32 4 



We call the kth step of the interval construction an l-step if (6) applies, a c-step if 
(7) applies, and an r-step if (8) applies, respectively. Fig. 1 shows an example for 
the interval construction. 

Clearly, this is not the only possible way to define an interval construction. 
Maybe the reader wonders why we did not center the intervals around $o- In fact, 
this construction would equally work for the proof. However, its definition would 
not be easier, since one still has to treat the case where -#0 is located close to 
the boundary. Moreover, our construction has the nice property that the interval 
bounds are finite binary fractions. Given the interval construction, we can identify 
the $ G I k with lowest complexity: 

$1 = argmin{ift/;(tf) : ■& e I k D 6}, 
d J k = argmin{lfi[;(i?) : d G J k H 0}, and 
A(k) = max{Kw($l) - Kw($ J k ),0}. 

If there is no d G I k fl ©, we set A(k) = Kw^l) = oo. 

Theorem 8 Let C [0, 1] be countable, <d G ; and = 2~ Kw ^\ where Kw{d) 
is some complexity measure on 0. Let A{k) be as introduced in the last paragraph, 
then 

oo oo 
n=0 k=l 

The proof is omitted. But we briefly discuss the assertion of this theorem. It 
states an error bound in terms of the arrangement of the false parameters which 
directly depends on the interval construction. As already indicated, a different 
interval construction would do as well, provided that it exponentially contracts to 
the true parameter. For a reasonable distribution of parameters, we might expect 
that A(k) increases linearly for k large enough, and thus ^2 -A( ^a/A(A;) remains 
bounded. In the next section, we identify cases where this holds. 
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5 Uniformly Distributed Weights 



We are now able to state some positive results following from Theorem 8. 

Theorem 9 Let C [0, 1] be a countable class of parameters and 4 6 the true 
parameter. Assume that there are constants a > 1 and b > such that 

min {KwOd) : # G [#o - 2~ fc , O + 2~ fc ] n9,i)/i> }> — - (9) 

a 

holds for all k > aKw^o) + b. Then we have 

oo 

E (^o - $ X Y < aKw{& ) + b< Kw(#o). 

n=0 

Proof. We have to show that 

oo 

^2- A (*VA(ife) < aKw{& ) + b, 

k=l 

then the assertion follows from Theorem 8. Let k\ = \aKw{"&o) +6 + 1] and k' = 
k — k\. It is not hard to see that max^ g / fc \d — 1? | < 2~ k+1 holds. Together with (9), 
this implies 

oo ki oo 

J2^ A(k) VW) < ]Tl+ J2 2- Kw W +Kw ^^Kw(# I k ) - Kw(#o) 

k=l k=l k=ki+l 

oo 



k=k x +l 



k'=i 

oo n~/ 

< aKw(& ) + b + 2 + ^2-£J— + Kw(&o). 



Observe ^ + Kw(fi Q ) < + y/Kw(& ), E fc ' < a > and b Y Lemma 4 (i), 
fe' /77 x 

J2k' 2 ~^ y a — a - Then the assertion follows. □ 



Letting j = (9) asserts that parameters $ with complexity Kw{&) = j must 
have a minimum distance of 2~ ja ~ b from t? . That is, if parameters with equal 
weights are (approximately) uniformly distributed in the neighborhood of $ , in the 
sense that they are not too close to each other, then fast convergence holds. The 
next two results are special cases based on the set of all finite binary fractions, 

Qb* ={$ = O.fhfh ■ ■ ■ Pn-il ■ n e N, Pi e B} U {0, 1}. 
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If ■& — O.P1P2 ■ ■ -Pn-il £ Qb*, its length is £(•&) = n. Moreover, there is a 
binary code {3[ . . . f3' n , for n, having at most n' < |_log 2 (n + 1)J bits. Then 
0/^0/3 2 . . . 0/?^/l/?i . . . f3 n -\ is a prefix-code for For completeness, we can define 
the codes for 1? = 0, 1 to be 10 and 11, respectively. So we may define a complexity 
measure on Qb* by 

Kw(0) = 2, Kw{\) = 2, and Kw{&) = + 2[\og 2 (£($) + 1)J for ■& ^ 0, 1. (10) 
There are other similar simple prefix codes on Q B . such that Kw^) > 

Corollary 10 Let 9 = Q B ,, 4 6 9 and Kw(ti) > £{$), then £ n E(tf - $ x ? < 
Kw^&q) holds. 

The proof is trivial, since Condition (9) holds with a = 1 and 6 = 0. This is a 
special case of a uniform distribution of parameters with equal complexities. The 
next corollary is more general, it proves fast convergence if the uniform distribution 
is distorted by some function <p. 



Corollary 11 Let tp : [0, 1] — > [0, 1] be an injective, N times continuously differen- 
tiate function. Let = y?(Q B *), Kw(ip(t)) > £(t) for all t e Qb*, and $0 = f(t ) 
for at <E Qb* ■ Assume that there is n < N and e > such that 



d n (p 



dt n 



(t) > c > for all t e [t — e, t + e] and 

d m ip 



dt n 



■(to) = for all 1 < m < n. 



Then we have 

E($ - $ X Y < nKw{$ ) + 21og 2 (n!) - 21og 2 c + nlog 2 e < nKw(ti ). 



Proof. Fix j > Kw($o), then 

Kw(<p(t)) > j for all t e [t - 2~\ t + 2~ j ] n Q B *. (11) 
Moreover, for all t G [to — 2 _J , to + 2 -J '], Taylor's theorem asserts that 

<p(t) = <p(to) + ^-(t - t ) n (12) 

for some t in (to,t) (or (t, to) if t < t ). We request in addition 2~i < e, then 
1 7^ I > c by assumption. Apply (12) to t = £ + 2~ j an d t = t — 2~ j and define 
= |~jn + log 2 (n!) — log 2 c] in order to obtain \<p(t + 2~ j ) — i? | > 2 _fe and \(p(t — 
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2~ j ) - i? | > 2~ fe . By injectivity of tp, we see that tp(t) <£ [tf - ^~ k ^o + 2~ fc ] if 
t$ [t - 2~ j ,t + 2- j ]. Together with (11), this implies 

Kw(#) > J > k - lo ^(n\) + log 2 c-l for ^ ^ e [i?Q _ 2 _ fcj ^ + 2 _ fc] R @ 

This is condition (9) with a = n and 6 = log 2 (n!) — log 2 c+l. Finally, the assumption 
< e holds if & > k\ — nlog 2 £: + log 2 (n!) — log 2 c + 1. This gives an additional 
contribution to the error of at most k\. □ 

Corollary 11 shows an implication of Theorem 8 for parameter identification: 
A class of models is given by a set of parameters Qb* and a mapping tp : Qb* — > 
0. The task is to identify the true parameter t or its image i?o = f{to)- The 
injectivity of tp is not necessary for fast convergence, but it facilitates the proof. 
The assumptions of Corollary 11 are satisfied if tp is for example a polynomial. In 
fact, it should be possible to prove fast convergence of MDL for many common 
parameter identification problems. For sets of parameters other than Q B *, e.g. the 
set of all rational numbers Q, similar corollaries can easily be proven. 

X 

How large is the constant hidden in "<"? When examining carefully the proof 
of Theorem 8, the resulting constant is quite large. This is mainly due to the 
frequent "wasting" of small constants. Supposably a smaller bound holds as well, 
perhaps 16. On the other hand, for the actual true expectation (as opposed to 
its upper bound) and complexities as in (10), numerical simulations indicate that 

E„E(^ -r) 2 <ii^(^ ). 

Finally, we state an implication which almost trivially follows from Theorem 8, 
since there ^ fc 2~ A<1 ') a/ A(k) < N is obvious. However, it may be very useful for 
practical purposes, e.g. for hypothesis testing. 

Corollary 12 Let contain N elements, Kw(-) be any complexity function on 0, 
and $0 € ©■ Then we have 

oo 

E (^o - ^) 2 < N + Kw(0 o ). 

n=l 

6 The Universal Case 

We briefly discuss the important universal setup, where Kw(-) is (up to an additive 
constant) equal to the prefix Kolmogorov complexity K (that is the length of the 
shortest self-delimiting program printing d on some universal Turing machine). Since 
Y2 k 2~ K ^ ^ K{k) = oo no matter how late the sum starts (otherwise there would be 
a shorter code for large k), we cannot apply Theorem 8. This means in particular 
that we do not even obtain our previous result, Theorem 1. But probably the 
following strengthening of the theorem holds under the same conditions, which then 
easily implies Theorem 1 up to a constant. 
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Conjecture 13 £„ E(tf - d x f < K{# ) + £ fe 2~ A(fe) ■ 

Then, take an incompressible finite binary fraction t?o £ Qb*, i-e. K($o) = 
^i'&o) + For /c > £($o), we can reconstruct i?o and /c from and ^($0) by 

just truncating ${ after £(tf ) bits. Thus K(${)+K(£($ )) > K(& )+K(k\& , K(& )) 
holds. Using Conjecture 13, we obtain 

^E(tf - < K{# ) + 2 K ^°» < i{d,){\og 2 i{^))\ (13) 

n 

where the last inequality follows from the example coding given in (10). So, under 
Conjecture 13, we obtain a bound which slightly exceeds the complexity K($o) if $0 
has a certain structure. It is not obvious if the same holds for all computable t? . In 
order to answer this question positive, one could try to use something like [Gac83, 
Eq.(2.1)]. This statement implies that as soon as K(k) > K\ for all k > k±, we 

have J2 k>kl 2~ K W < 2~ Kl Ki(\og 2 Ki) 2 . It is possible to prove an analogous result 
for instead of k, however we have not found an appropriate coding that does 
without knowing $ - Since the resulting bound is exponential in the code length, 
we therefore have not gained anything. 

Another problem concerns the size of the multiplicative constant that is hidden 
in the upper bound. Unlike in the case of uniformly distributed weights, it is now of 
exponential size, i.e. 2°( 1 \ This is no artifact of the proof, as the following example 
shows. 

Example 14 Let U be some universal Turing machine. We construct a second 
universal Turing machine U' from U as follows: Let N > 1. If the input of U' is 
l N p, where 1^ is the string consisting of N ones and p is some program, then U 
will be executed on p. If the input of U' is 0^, then U' outputs \. Otherwise, if the 
input of U' is x with x £ M N \ {0 N , 1^}, then U' outputs \ + 2' x ~\ For ^ = |, 
the conditions of a slight generalization of Proposition 5 are satisfied (where the 

complexity is relative to U'), thus ^ n E(^ :E - ^ ) 2 > 2 N . 

Can this also happen if the underlying universal Turing machine is not "strange" 
in some sense, like U', but "natural"? Again this is not obvious. One would have 
to define first a "natural" universal Turing machine which rules out cases like U' . If 
N is not too large, then one can even argue that U' is natural in the sense that its 
compiler constant relative to U is small. 

There is a relation to the class of all deterministic (generally non-i.i.d.) mea- 
sures. For this setup, MDL predicts the next symbol just according to the monotone 
complexity Km, see [Hut03b]. According to [Hut03b, Theorem 5], 2~ Km is very close 
to the universal semimeasure M (this is due to [ZL70]). Then the total prediction 
error (which is defined slightly differently in this case) can be shown to be bounded 
by 2°^Km( 

x <oo)^ [Hut04]. The similarity to the (unproven) bound (13) "huge 
constant x polynomial" for the universal Bernoulli case is evident. 
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7 Discussion and Conclusions 



We have discovered the fact that the instantaneous and the cumulative loss bounds 
can be incompatible. On the one hand, the cumulative loss for MDL predictions 
may be exponential, i.e. 2 Kw ^°\ Thus it implies almost sure convergence at a 
slow rate, even for arbitrary discrete model classes [PH04]. On the other hand, 
the instantaneous loss is always of order -Kw($o), implying fast convergence in 
probability and a cumulative loss bound of Kw^q) Inn. Similar logarithmic loss 
bounds can be found in the literature for continuous model classes [Ris96]. 

A different approach to assess convergence speed is presented in [BC91]. There 
in index of resolvability is introduced, which can be interpreted as the difference of 
the expected MDL code length and the expected code length under the true model. 
For discrete model classes, they show that the index of resolvability converges to 
zero as -Kw{&q) [BC91, Equation (6.2)]. Moreover, they give a convergence of 
the predictive distributions in terms of the Hellinger distance [BC91, Theorem 4]. 
This implies a cumulative (Hellinger) loss bound of Kw(-& )ln.n and therefore fast 
convergence in probability. 

If the prior weights are arranged nicely, we have proven a small finite loss bound 
Kw("&o) for MDL (Theorem 8). If parameters of equal complexity are uniformly 
distributed or not too strongly distorted (Theorem 9 and Corollaries), then the error 
is within a small multiplicative constant of the complexity Kw{^q). This may be 
applied e.g. for the case of parameter identification (Corollary 11). A similar result 
holds if O is finite and contains only few parameters (Corollary 12), which may be 
e.g. satisfied for hypothesis testing. In these cases and many others, one can interpret 
the conditions for fast convergence as the presence of prior knowledge. One can show 
that if a predictor converges to the correct model, then it performs also well under 
arbitrarily chosen bounded loss-functions [Hut03a, Theorem 4]. Moreover, we can 
then conclude good properties for other machine learning tasks such as classification, 
as discussed in the introduction. From an information theoretic viewpoint one may 
interpret the conditions for a small bound in Theorem 8 as "good codes" . 

The main restriction of our positive result is the fact that we have proved it 
only for the Bernoulli case. We therefore argue that it generalizes to arbitrary 
i.i.d settings. Let i?q £ [0,1]^, ^j^o^ = 1 be a probability vector that generates 
sequences of i.i.d. samples in {1, . . . ,N}°°. Assume that i?o stays away from the 
boundary (the other case is treated similarly). Then we can define a sequence of 
nested sets in dimension N — 1 in analogy to the interval construction. The main 
points of the proof are now the following two: First, for an observed parameter a 
far from $ , the probability of a decays exponentially, and second, for a close to $ , 
some d far from $ can contribute at most for short time. These facts hold in the 
general i.i.d case like in the Bernoulli case. However, the rigorous proof of it is yet 
more complicated and technical than for the Bernoulli case. (Compare the proof of 
the main result in [Ris96].) 

We conclude with an open question. In abstract terms, we have proven a con- 
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vergence result for the Bernoulli (or i.i.d) case by mainly exploiting the geometry 
of the space of distributions. This is in principle very easy, since for Bernoulli this 
space is just the unit interval, for i.i.d it is the space of probability vectors. It is not 
obvious how (or if at all) this approach can be transferred to general (computable) 
measures. 
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